To build a predictive model for attrition that provides useful input into an ROI calculation for retention intervention(s).
Using the provided data which included variables such as employee satisfaction levels, last evaluation score, number of projects, average hours worked in a month, tenure, promotions, and salary, we sought to find a model that could accurately predict which employees may leave the company. Given the variables in the data, our goal was to determine which predictors could provide the most signal that an employee intended to leave. If this information was known, we may be able to intervene and prevent the employee from leaving, thereby saving the company money.
See the section called Business Value: Visualizing Model Performance at the bottom of this document for visualizations of the ROI discussed here. well actually, bugs in the package prevented this little bit of magic. maybe next time
The model proposed is a Random Forest model. The model uses random combinations of predictor variables to form a decision tree based prediction for the outcome variable.
The variable importance statistics for the RF model are ase follows:
Predictor | 0 | 1 | MeanDecreaseAccuracy | MeanDecreaseGini
———————| ——— | ——— | ——————– | —————- satisfaction_level | 101.400344 | 337.898530 | 319.888350 | 1124.759934 last_evaluation | 18.640044 | 111.035368 | 113.710098 | 302.200133 number_project | 69.694218 | 239.160597 | 235.333583 | 445.560219 average_monthly_hours | 62.586243 | 101.416338 | 111.910062 | 328.637547 time_spend_company | 57.939171 | 82.418274 | 91.668575 | 459.621633 Work_accident | 4.319829 | 8.684521 | 9.092943 | 4.591932 promotion_last_5years | 2.830797 | 5.160545 | 5.064632 | 1.076596 department | 7.580276 | 48.293374 | 29.060656 | 29.775646 salary | 7.210778 | 27.537577 | 19.853285 | 13.094145
Examining these stats, we see that satisfaction_level, last_evaluation, number_project variables give the most signal (aka, have the most influence) as factors indicating an employee may leave the company. As these values increase, so too does the probability that the employee will quit. The proposed model has an accuracy of 98.4% for the test dataset.
It is important to note that this model is valid for the test data set, but may not generalize well to future data. The proposed model should be well validated against new datasets before proceeding. That said, the Random Forest approach should be more generalizable than a standard Decision Tree model based on the underlying statistical properties of the
Confidence was checked through cross-validation of 10 k-folds in the data set. To gain further confidence, future analysts may wish to increase the number of bootstrap datasets they test to increase confidence in the overall model as well as gain more signal on specific features (aka predictor variables) of the model.
Unfortunately, Random Forests can be difficult to interpret, so there is not a direct intervention that can be proposed based on this model. However, if we focused on the leading variable - satisfaction level - and ran an intervention to improve satisfaction, we can understand how much money may be saved.
For this objective, it has been determined that the loss of an employee costs the company 100,000 dollars. We are recommending an intervention program with a cost of 5,000 dollars per employee. Based on our research and data, we expect the intervention program to work approximately 50% of the time. With these costs in mind, it is understood that the company will save 95,000 dollars per employee who does not leave as a result of the intervention.
In the final running of the model, it was accurately predicted that 1693 people would leave the company. Assuming we can save half of those people with our intervention (n=847), the company will have saved 80.4 M dollars, minus the cost of the intervention for those who weren’t saved (4.2M), a total of 76.2M dollars.
With 98.8% accuracy in the model and
Confusion Matrix for rf_final OOB estimate of error rate: 1.19%
Confusion matrix: 0 1 class.error 0 5717 9 0.001571778 1 80 1693 0.045121263
The following code was used to run and interpret 4 models to predict attrition using the provided dataset. The first sections include exploratory data analysis and visualizations to understand the context of the data. The data was relatively “clean” with no missing values or duplicated observations. Models were developed using tidy modeling principles, including workflows and recipes. The recipes include preprocessing steps to normalize and balance the data.
A table of model results and decisions are below (more details can be found in the model_comp table). The approach attempted several models with varying levels of complexity. The simpler models did not perform well and did not predict above/beyond chance. The more complex models perform very well on the available dataset, but are difficult to interpret for the business.
CAVEAT: The proposed model should be well validated against new datasets before proceeding. The Random Forest approach should be more generalizable than a standard Decision Tree.
| Model | Accuracy | Decision |
|---|---|---|
| Logistic Regression | 0.7473040 | Low accuracy |
| Support Vector Class | 0.7555702 | Low accuracy |
| Neural Net | 0.9361251 | High accuracy, unable to interpret results |
| Random Forest | 0.9846636 | High accuracy, unable to interpret results |
library(readxl)
Data <- read_excel("C:/Users/Jaclyn/Desktop/GithubClone/psyc6841_finalproject/01_data/PSYC6841_Final_Project_Data.xlsx")
str(Data)
tibble [9,999 x 10] (S3: tbl_df/tbl/data.frame)
$ satisfaction_level : num [1:9999] 0.1 1 0.69 0.63 0.88 0.8 0.11 0.85 0.37 0.99 ...
$ last_evaluation : num [1:9999] 0.91 0.85 0.62 0.93 0.65 0.76 0.8 0.67 0.71 0.54 ...
$ number_project : num [1:9999] 6 4 4 4 5 4 6 4 6 4 ...
$ average_monthly_hours: num [1:9999] 255 156 183 238 228 135 256 228 266 239 ...
$ time_spend_company : num [1:9999] 4 3 4 2 3 2 4 4 5 3 ...
$ Work_accident : num [1:9999] 0 0 0 0 0 0 0 0 0 0 ...
$ left : num [1:9999] 1 0 0 0 0 0 1 0 0 0 ...
$ promotion_last_5years: num [1:9999] 0 0 0 0 0 0 0 0 0 0 ...
$ sales : chr [1:9999] "sales" "sales" "support" "technical" ...
$ salary : chr [1:9999] "medium" "low" "low" "low" ...
skim(Data)
-- Data Summary ------------------------
Values
Name Data
Number of rows 9999
Number of columns 10
_______________________
Column type frequency:
character 2
numeric 8
________________________
Group variables None
-- Variable type: character -------------------------------------------------------------------------------
# A tibble: 2 x 8
skim_variable n_missing complete_rate min max empty n_unique whitespace
* <chr> <int> <dbl> <int> <int> <int> <int> <int>
1 sales 0 1 2 11 0 10 0
2 salary 0 1 3 6 0 3 0
-- Variable type: numeric ---------------------------------------------------------------------------------
# A tibble: 8 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 satisfaction_level 0 1 0.611 0.250 0.09 0.44 0.64 0.82 1 ▃▅▇▇▇
2 last_evaluation 0 1 0.717 0.171 0.36 0.56 0.72 0.87 1 ▂▇▆▇▇
3 number_project 0 1 3.81 1.24 2 3 4 5 7 ▇▅▃▂▁
4 average_monthly_hours 0 1 201. 50.0 96 156 199 245 310 ▃▇▆▇▂
5 time_spend_company 0 1 3.52 1.49 2 3 3 4 10 ▇▃▁▁▁
6 Work_accident 0 1 0.147 0.354 0 0 0 0 1 ▇▁▁▁▂
7 left 0 1 0.236 0.425 0 0 0 0 1 ▇▁▁▁▂
8 promotion_last_5years 0 1 0.0208 0.143 0 0 0 0 1 ▇▁▁▁▁
The dataset has 9,999 rows, 10 columns, 2 factor columns, 8 numeric columns, and no missing values (n_missing) across all columns of the data frame.
glimpse(Data)
Rows: 9,999
Columns: 10
$ satisfaction_level <dbl> 0.10, 1.00, 0.69, 0.63, 0.88, 0.80, 0.11, 0.85, 0.37, 0.99, 0.66, 0.80, 0.6~
$ last_evaluation <dbl> 0.91, 0.85, 0.62, 0.93, 0.65, 0.76, 0.80, 0.67, 0.71, 0.54, 0.95, 0.82, 0.7~
$ number_project <dbl> 6, 4, 4, 4, 5, 4, 6, 4, 6, 4, 5, 3, 5, 4, 3, 2, 3, 3, 3, 3, 4, 4, 3, 2, 4, ~
$ average_monthly_hours <dbl> 255, 156, 183, 238, 228, 135, 256, 228, 266, 239, 206, 218, 270, 238, 254, ~
$ time_spend_company <dbl> 4, 3, 4, 2, 3, 2, 4, 4, 5, 3, 2, 3, 3, 10, 2, 3, 3, 5, 3, 2, 2, 4, 3, 3, 4,~
$ Work_accident <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, ~
$ left <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, ~
$ promotion_last_5years <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ sales <chr> "sales", "sales", "support", "technical", "sales", "technical", "IT", "supp~
$ salary <chr> "medium", "low", "low", "low", "medium", "low", "low", "low", "low", "low",~
colnames(Data)
[1] "satisfaction_level" "last_evaluation" "number_project" "average_monthly_hours"
[5] "time_spend_company" "Work_accident" "left" "promotion_last_5years"
[9] "sales" "salary"
#VIEWING CHARACTER DATA VALUES
Data %>%
select_if(is.character) %>%
map(unique) #from purrr -- shows us all of the values available for our character data
select_if: dropped 8 variables (satisfaction_level, last_evaluation, number_project, average_monthly_hours, time_spend_company, …)
$sales
[1] "sales" "support" "technical" "IT" "marketing" "product_mng" "accounting"
[8] "management" "RandD" "hr"
$salary
[1] "medium" "low" "high"
The sales column appears to be mislabelled. We’ll rename it “Department” based on the values we see in that column.
#RENAME 'SALES' COLUMN TO `DEPARTMENT`
Data <- rename(Data, department = sales)
rename: renamed one variable (department)
#DOUBLE CHECK FACTOR
glimpse(Data)
Rows: 9,999
Columns: 10
$ satisfaction_level <dbl> 0.10, 1.00, 0.69, 0.63, 0.88, 0.80, 0.11, 0.85, 0.37, 0.99, 0.66, 0.80, 0.6~
$ last_evaluation <dbl> 0.91, 0.85, 0.62, 0.93, 0.65, 0.76, 0.80, 0.67, 0.71, 0.54, 0.95, 0.82, 0.7~
$ number_project <dbl> 6, 4, 4, 4, 5, 4, 6, 4, 6, 4, 5, 3, 5, 4, 3, 2, 3, 3, 3, 3, 4, 4, 3, 2, 4, ~
$ average_monthly_hours <dbl> 255, 156, 183, 238, 228, 135, 256, 228, 266, 239, 206, 218, 270, 238, 254, ~
$ time_spend_company <dbl> 4, 3, 4, 2, 3, 2, 4, 4, 5, 3, 2, 3, 3, 10, 2, 3, 3, 5, 3, 2, 2, 4, 3, 3, 4,~
$ Work_accident <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, ~
$ left <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, ~
$ promotion_last_5years <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ department <chr> "sales", "sales", "support", "technical", "sales", "technical", "IT", "supp~
$ salary <chr> "medium", "low", "low", "low", "medium", "low", "low", "low", "low", "low",~
#TABLE OF NUMBER UNIQUE VALUES FOR NUMERIC VARIABLES
Data %>%
select_if(is.numeric) %>%
map_df(~ unique(.) %>% length()) %>% # tries to turn it into a df instead of a list
gather() %>%
arrange(desc(value))
select_if: dropped 2 variables (department, salary)
gather: reorganized (satisfaction_level, last_evaluation, number_project, average_monthly_hours, time_spend_company, …) into (key, value) [was 1x8, now 8x2]
#VIEW DUMMY CODED VARIABLES
Data %>%
distinct(Work_accident, left, promotion_last_5years)
distinct: removed 9,991 rows (>99%), 8 rows remaining
#VIEW LEVELS OF SALARY
Data %>%
distinct(salary)
distinct: removed 9,996 rows (>99%), 3 rows remaining
The salary data needs to be turned into an ordered factor. We’ll order the levels from Low to High.
#CHANGE SALARY TO ORDERED FACTOR
Data <- Data %>%
mutate(salary = factor(salary,
levels = c("low",
"medium",
"high")))
mutate: converted 'salary' from character to factor (0 new NA)
Three of our numeric variables appear to be dummy coded (0=No/1=Yes), including Work_accident, left, promotion_last_5years.
One of these variables, Left, is our outcome variable, determining if someone stayed at or left the company.
Three of our numeric variables appear to be continuous, including average_monthly_hours, satisfaction_level, and last_evaluation.
Two numeric variables are discrete, including only 6 (number_project) or 8 (time_spend_company) values each.
#CHECK FOR MISSING DATA
library(Amelia)
Warning: package ‘Amelia’ was built under R version 4.0.5
Loading required package: Rcpp
Warning: package ‘Rcpp’ was built under R version 4.0.5
Attaching package: ‘Rcpp’
The following object is masked from ‘package:rsample’:
populate
##
## Amelia II: Multiple Imputation
## (Version 1.7.6, built: 2019-11-24)
## Copyright (C) 2005-2021 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
missmap(Data, y.at=c(1), y.labels=c(''), col=c('orange', 'purple'))
Warning: Unknown or uninitialised column: `arguments`.
Warning: Unknown or uninitialised column: `arguments`.
Wow, no missing variables! We will consider our dataset whole and do not need to impute any missing data.
We’ll also check if there are any duplicate lines we need to be aware of.
#CHECK FOR ROWS WITH EXACT SAME VALUES
sum(is.na(duplicated(Data)))
[1] 0
#IS OUR ID COLUMN KOSHER?
which(duplicated(Data$ID))
Warning: Unknown or uninitialised column: `ID`.
integer(0)
There do not appear to be any duplicate rows or column data.
psych::describe(Data)
At first glance, the variables appear as expected. Points of interest: * Large, positive skew in time_spend_company, one of our discrete variables * Large SD in number_project and average_monthly_hours * Large Range in average_monthly_hours
# ATTRITION PERCENTAGE IN DATASET
Data %>%
tabyl(left) %>%
adorn_pct_formatting(digits = 0, affix_sign = TRUE)
left n percent
0 7635 76%
1 2364 24%
Taking a look at our outcome variable we find that Our data is unbalanced, given about 1/4 of the observations left the organization and 3/4’s stayed. We’ll need to account for this in our recipes.
#SEVERAL TABLES TO LOOK AT VARIABLES IN ISOLATION
# CONTINUOUS VARIABLES
# tabyl(Data, satisfaction_level)
# tabyl(Data, last_evaluation)
# tabyl(Data, average_monthly_hours)
#DISCRETE & CATEGORICAL VARIABLES
tabyl(Data, number_project) %>%
adorn_pct_formatting(digits = 0, affix_sign = TRUE)
number_project n percent
2 1593 16%
3 2741 27%
4 2837 28%
5 1853 19%
6 796 8%
7 179 2%
tabyl(Data, time_spend_company) %>%
adorn_pct_formatting(digits = 0, affix_sign = TRUE)
time_spend_company n percent
2 2158 22%
3 4263 43%
4 1702 17%
5 992 10%
6 485 5%
7 119 1%
8 124 1%
10 156 2%
tabyl(Data, Work_accident) %>%
adorn_pct_formatting(digits = 0, affix_sign = TRUE)
Work_accident n percent
0 8528 85%
1 1471 15%
tabyl(Data, promotion_last_5years) %>%
adorn_pct_formatting(digits = 0, affix_sign = TRUE)
promotion_last_5years n percent
0 9791 98%
1 208 2%
tabyl(Data, department) %>%
adorn_pct_formatting(digits = 0, affix_sign = TRUE)
department n percent
accounting 516 5%
hr 493 5%
IT 816 8%
management 426 4%
marketing 599 6%
product_mng 589 6%
RandD 524 5%
sales 2754 28%
support 1499 15%
technical 1783 18%
tabyl(Data, salary) %>%
adorn_pct_formatting(digits = 0, affix_sign = TRUE)
salary n percent
low 4866 49%
medium 4301 43%
high 832 8%
Checking the balance of predictor variables, we see:
time_spent_companywork_accidentpromotion_last_5yearsdepartment distribution is uneven, with 28% of observations showing in the Sales department. Support and Technical departments closely follow. Other departments are only 5% or 6% of the data.Data %>%
introduce() %>%
pivot_longer(everything())
pivot_longer: reorganized (rows, columns, discrete_columns, continuous_columns, all_missing_columns, …) into (name, value) [was 1x9, now 9x2]
plot_intro(Data)
#library(funModeling)
#FACTOR VARIABLE FREQUENCY
freq(Data)
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
[1] "Variables processed: department, salary"
skim(Data) %>%
filter(skim_type == 'numeric')
filter: removed 2 rows (20%), 8 rows remaining
-- Data Summary ------------------------
Values
Name Data
Number of rows 9999
Number of columns 10
_______________________
Column type frequency:
numeric 8
________________________
Group variables None
-- Variable type: numeric ---------------------------------------------------------------------------------
# A tibble: 8 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 satisfaction_level 0 1 0.611 0.250 0.09 0.44 0.64 0.82 1 ▃▅▇▇▇
2 last_evaluation 0 1 0.717 0.171 0.36 0.56 0.72 0.87 1 ▂▇▆▇▇
3 number_project 0 1 3.81 1.24 2 3 4 5 7 ▇▅▃▂▁
4 average_monthly_hours 0 1 201. 50.0 96 156 199 245 310 ▃▇▆▇▂
5 time_spend_company 0 1 3.52 1.49 2 3 3 4 10 ▇▃▁▁▁
6 Work_accident 0 1 0.147 0.354 0 0 0 0 1 ▇▁▁▁▂
7 left 0 1 0.236 0.425 0 0 0 0 1 ▇▁▁▁▂
8 promotion_last_5years 0 1 0.0208 0.143 0 0 0 0 1 ▇▁▁▁▁
#plot_num(Data)
# Let's look at this without our dummy coded data
Data %>%
select("satisfaction_level",
"last_evaluation",
"number_project",
"average_monthly_hours",
"time_spend_company") %>%
plot_num()
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
*Our numeric variables are not very normally distributed. Average monthly and last evaluation appear bimodal.The remaining variables are somewhat skewed.
# DUMMY CODED
Data %>%
select("Work_accident",
"left",
"promotion_last_5years") %>%
plot_num()
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
Our category variables are unbalanced.
## Left: frequency distribution of all discrete variables
plot_bar(Data)
## Right: `left` distribution of all discrete variables
plot_bar(Data, with = "left")
left. Some of the departments with less representation have more attrition, such as HR and RandD.plot_bar(Data, by = "left")
plot_bar(Data, by = "salary")
Data %>%
select("satisfaction_level",
"last_evaluation",
"number_project",
"average_monthly_hours",
"time_spend_company") %>%
plot_histogram()
select: dropped 5 variables (Work_accident, left, promotion_last_5years, department, salary)
NA
Data %>%
select("average_monthly_hours",
"last_evaluation",
"satisfaction_level") %>%
plot_density()
select: dropped 7 variables (number_project, time_spend_company, Work_accident, left, promotion_last_5years, …)
#QQ PLOTS
Data %>%
select("average_monthly_hours",
"last_evaluation",
"satisfaction_level") %>%
plot_qq()
select: dropped 7 variables (number_project, time_spend_company, Work_accident, left, promotion_last_5years, …)
left <- Data$left #Create variable
Data %>%
plot_prcomp(maxcat = 5L)
1 features with more than 5 categories ignored!
department: 10 categories
var(left)
[1] 0.1805456
As we see in the Data:
Observations: 9,999 with Variables: 10
There is no missing data or duplicate data.
The Outcome variable is left with 7635 (76%) ‘No’ and 2364 (24%) ‘Yes’. This variable indicates if an employee left the company.
Three of our variables are dummy coded (0=No/1=Yes), including Work_accident, left, promotion_last_5years. One of these variables, Left, is our outcome variable, determining if someone stayed at or left the company.
Three of our numeric variables appear to be continuous, including average_monthly_hours, satisfaction_level, and last_evaluation.
Two numeric variables are discrete, including only 6 (number_project) and 8 (time_spend_company) values each.
Demographic data is limited. Variables such as department, salary level, and time_spend_company (tenure in years) are the closest thing to demographic information we have.
Some variables provide personal information, such as satisfaction_level and Work_accident. And a few indicate overall performance for the employee, such as last_evaluation and promotion_last_5years.
We’ll need to account for unbalance in the dataset in preprocessing. We’ll also need to normalize (center and scale) numeric predictors.
library(corrr)
Warning: package ‘corrr’ was built under R version 4.0.5
Data %>% select(where(is.numeric)) %>%
correlate()
select: dropped 2 variables (department, salary)
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
#CREATE OBJECT OF CORRELATION MATRIX
cor_matrix <- Data %>%
select(where(is.numeric)) %>%
correlate()
select: dropped 2 variables (department, salary)
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
#REARRANGE SHOULD GROUP ANY HIGHLY CORRELATED COLUMNS
cor_matrix %>%
rearrange()
Registered S3 methods overwritten by 'registry':
method from
print.registry_field proxy
print.registry_entry proxy
cor_matrix %>%
corrr::focus(left) %>%
arrange(desc(left))
cor_matrix %>%
corrr::focus(left, number_project, satisfaction_level) %>%
arrange(desc(left))
cor_matrix %>%
stretch(na.rm = TRUE, remove.dups = TRUE)
#To create a structured version of our correlation data, which we can pass to ggplot or dplyr for further data analysis, we can use the stretch() function. https://www.gmudatamining.com/lesson-07-r-tutorial.html
# histogram of the correlation values observed for all unique pairs of variables in cor_matrix
cor_matrix %>%
stretch(na.rm = TRUE, remove.dups = TRUE) %>%
ggplot(mapping = aes(x = r)) +
geom_histogram(fill = '#006EA1', color = 'white', bins = 15) +
labs(title = 'Correlation Values',
x = 'Correlation Value',
y = 'Count')
cor_matrix2 <- Data %>%
select(where(is.numeric)) %>%
cor()
select: dropped 2 variables (department, salary)
cor_matrix2
satisfaction_level last_evaluation number_project average_monthly_hours
satisfaction_level 1.00000000 0.110623371 -0.154398550 -0.029060284
last_evaluation 0.11062337 1.000000000 0.352631362 0.349228288
number_project -0.15439855 0.352631362 1.000000000 0.416829145
average_monthly_hours -0.02906028 0.349228288 0.416829145 1.000000000
time_spend_company -0.10824181 0.133035838 0.195597643 0.127196959
Work_accident 0.06665317 -0.013389981 -0.009984353 -0.014007727
left -0.38759296 0.002576301 0.029574973 0.068494570
promotion_last_5years 0.02169405 -0.014264824 -0.004259769 -0.003583158
time_spend_company Work_accident left promotion_last_5years
satisfaction_level -0.108241811 0.066653172 -0.387592956 0.021694046
last_evaluation 0.133035838 -0.013389981 0.002576301 -0.014264824
number_project 0.195597643 -0.009984353 0.029574973 -0.004259769
average_monthly_hours 0.127196959 -0.014007727 0.068494570 -0.003583158
time_spend_company 1.000000000 0.008205063 0.131600744 0.065150746
Work_accident 0.008205063 1.000000000 -0.160663519 0.046291356
left 0.131600744 -0.160663519 1.000000000 -0.061312236
promotion_last_5years 0.065150746 0.046291356 -0.061312236 1.000000000
corrplot(cor_matrix2,
method = "ellipse", # ellipses instead of circles
type = "upper", # keep upper half only
order = "FPC", # order by loadings on FPC
tl.col = "black") # color of variable text lables
There are no variables that are overly correlated that need to be removed. Collinearity should not be an issue. Relationships in variables seems sensical. For example - Number of Projects, Average Monthly Hours, and Last Evaluation all correlate more strongly with one another, which makes sense in a work environment. Lower correlations are seen between satisfaction level and “left”, which we would also expect.
#CREATE A TABLE OF MEANS FOR NUMERIC VARIABLE, GROUPED BY OUTCOME VARIABLE
Data %>%
group_by(left) %>%
summarise(
count = n(),
mean_sat = mean(satisfaction_level),
mean_eval = mean(last_evaluation),
mean_proj = mean(number_project),
mean_hrs = mean(average_monthly_hours),
mean_yrs = mean(time_spend_company)
)
group_by: one grouping variable (left)
summarise: now 2 rows and 7 columns, ungrouped
Grouping by our outcome variable, left, we can see that the mean for many of our predictor variables does not vary by much. The largest difference is found in satisfaction_level, with those who left showing a lower satisfaction level, on average.
#SATISFACTION BY SALARY LEVEL
Data %>%
group_by(salary) %>%
summarise(
n = n(),
mean = mean(satisfaction_level),
sd = sd(satisfaction_level)
)
group_by: one grouping variable (salary)
summarise: now 3 rows and 4 columns, ungrouped
Average Satisfaction level does not vary much by salary level, although it does get slightly higher at each level.
#UNDERSTANDING COMBINATIONS OF DATA
Data %>%
tabyl(left, salary) %>%
adorn_totals(c('row', 'col')) %>%
adorn_percentages('col') %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title('combined')
left/salary low medium high Total
0 71% (3453) 79% (3413) 92% (769) 76% (7635)
1 29% (1413) 21% (888) 8% (63) 24% (2364)
Total 100% (4866) 100% (4301) 100% (832) 100% (9999)
Departures by salary level follows the overall dataset balance of approximately 75% stay / 25% leave at Low and Medium levels.
Exception: For those with high salary, only 8% left the company.
Data %>%
tabyl(left, department) %>%
adorn_totals(c('row', 'col')) %>%
adorn_percentages('col') %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title('combined')
left/department accounting hr IT management marketing product_mng RandD sales
0 74% (382) 72% (355) 78% (637) 85% (360) 77% (459) 77% (454) 85% (448) 75% (2079)
1 26% (134) 28% (138) 22% (179) 15% (66) 23% (140) 23% (135) 15% (76) 25% (675)
Total 100% (516) 100% (493) 100% (816) 100% (426) 100% (599) 100% (589) 100% (524) 100% (2754)
support technical Total
75% (1128) 75% (1333) 76% (7635)
25% (371) 25% (450) 24% (2364)
100% (1499) 100% (1783) 100% (9999)
We see a relatively even split of departures in each department, tending toward 75% stay / 25% go, which matches our overall dataset balance.
The only exception is Management (85% stay). (Also, our Random Variable has an 85% / 15% split)
Data %>%
tabyl(left, number_project) %>%
adorn_totals(c('row', 'col')) %>%
adorn_percentages('col') %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title('combined')
left/number_project 2 3 4 5 6 7 Total
0 35% (552) 98% (2695) 91% (2583) 78% (1452) 44% (353) 0% (0) 76% (7635)
1 65% (1041) 2% (46) 9% (254) 22% (401) 56% (443) 100% (179) 24% (2364)
Total 100% (1593) 100% (2741) 100% (2837) 100% (1853) 100% (796) 100% (179) 100% (9999)
number_project variable to outcome variable varies from others. There seems to be a “sweet spot” of 3 or 4 projects having less attrition. Those with 7 projects attrited at 100%.Data %>%
tabyl(left, promotion_last_5years) %>%
adorn_totals(c('row', 'col')) %>%
adorn_percentages('col') %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title('combined')
left/promotion_last_5years 0 1 Total
0 76% (7439) 94% (196) 76% (7635)
1 24% (2352) 6% (12) 24% (2364)
Total 100% (9791) 100% (208) 100% (9999)
Data %>%
tabyl(left, Work_accident) %>%
adorn_totals(c('row', 'col')) %>%
adorn_percentages('col') %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title('combined')
left/Work_accident 0 1 Total
0 74% (6270) 93% (1365) 76% (7635)
1 26% (2258) 7% (106) 24% (2364)
Total 100% (8528) 100% (1471) 100% (9999)
Data %>%
tabyl(left, time_spend_company) %>%
adorn_totals(c('row', 'col')) %>%
adorn_percentages('col') %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title('combined')
left/time_spend_company 10 2 3 4 5 6 7
0 100% (156) 98% (2122) 75% (3205) 66% (1118) 44% (441) 72% (350) 100% (119)
1 0% (0) 2% (36) 25% (1058) 34% (584) 56% (551) 28% (135) 0% (0)
Total 100% (156) 100% (2158) 100% (4263) 100% (1702) 100% (992) 100% (485) 100% (119)
8 Total
100% (124) 76% (7635)
0% (0) 24% (2364)
100% (124) 100% (9999)
Data %>%
group_by(left) %>%
summarise(
n = n(),
mean = mean(satisfaction_level),
sd = sd(satisfaction_level)
)
group_by: one grouping variable (left)
summarise: now 2 rows and 4 columns, ungrouped
Data %>%
group_by(left) %>%
summarise(
n = n(),
mean = mean(last_evaluation),
sd = sd(last_evaluation)
)
group_by: one grouping variable (left)
summarise: now 2 rows and 4 columns, ungrouped
Data %>%
group_by(left) %>%
summarise(
n = n(),
mean = mean(average_monthly_hours),
sd = sd(average_monthly_hours)
)
group_by: one grouping variable (left)
summarise: now 2 rows and 4 columns, ungrouped
After examining the crosstab tables of various combinations of our data, we’ll also run some visualizations to see if anything else may pop out that is surprising or interesting for our modelling purposes.
#library(CGPfunctions)
#https://www.youtube.com/watch?v=BoAvAuOYv3s
#VISUALIZING
PlotXTabs2(Data, department, time_spend_company, results.subtitle = FALSE)
PlotXTabs2(Data, department, number_project, results.subtitle = FALSE)
PlotXTabs2(Data, department, time_spend_company, results.subtitle = FALSE)
PlotXTabs(Data, department, salary)
Plotted dataset Data variables department by salary
PlotXTabs(Data, number_project, salary)
Plotted dataset Data variables number_project by salary
PlotXTabs(Data, left, department)
Plotted dataset Data variables left by department
PlotXTabs(Data, left, salary)
Plotted dataset Data variables left by salary
PlotXTabs(Data, left, number_project)
Plotted dataset Data variables left by number_project
PlotXTabs(Data, left, time_spend_company)
Plotted dataset Data variables left by time_spend_company
In viewing various combinations of our variables, the relationships seem to make sense for the data set explored. For example, average satisfaction level decreases for those who left the company and increases slightly for each level of salary. Average monthly hours is slightly higher for people who left the company.
On the whole, our dataset has an approximately 75/25 split of retention and attrition. In most cases, this attrition percentage split stayed constant when accounting for various facets of the data. A few exceptions to the ~25% attrition rate were noted:
salary is High = 8% attritiondepartment is Management = 15% attritionnumber_projects is… – 7 projects = 100% attrition (high) – 6 projects = 56% attrition (high) – 5 projects = 22% attrition (expected) – 4 projects = 9% attrition (low) – 3 projects = 2% attrition (low) – 2 projects = 65% attrition (high)promoted_last_5yrs is Yes = 6% attritionWork_accident is Yes = 7% attritiontime_spend_company is >7 years = 0% attrition# ATTRITION PERCENTAGE IN DATASET
Data %>%
tabyl(left) %>%
adorn_pct_formatting(digits = 0, affix_sign = TRUE)
left n percent
0 7635 76%
1 2364 24%
Taking a look at our outcome variable we find that Our data is unbalanced, given about 1/4 of the observations left the organization and 3/4’s stayed. We’ll need to account for this in our recipes.
First, We’ll make our outcome variable a factor for easier working outside of recipes.
Data <- Data %>%
mutate(left = as.factor(left))
mutate: converted 'left' from double to factor (0 new NA)
Then, we’ll set seed and split the data. We’re using a default 75/25 split, as we have a nice large amount of data and this should leave enough data in each set.
set.seed(1013)
#SPLIT DATA WITH CONSIDERATION OF OUTCOME VARIABLE `LEFT`
data_split <- initial_split(Data, prop = 0.75, strata = "left")
#CREATE TRAINING & TEST DATA SETS
train_data <- training(data_split)
test_data <- testing(data_split)
#DOUBLE CHECK OUTCOME SPLIT IN TRAINING & TEST DATA SETS
tabyl(train_data$left)
train_data$left n percent
0 5726 0.7635685
1 1773 0.2364315
tabyl(test_data$left)
test_data$left n percent
0 1909 0.7636
1 591 0.2364
Training data and test data sets were created with a 75 / 25 split of original observations. Due to imbalance, we specified our outcome variable (left) when splitting to ensure training and test sets had an approximately equal number of retention & attrition in the outome variable. Checking the resulting sets shows us we successfully held the 76%/24% ratio of left No/Yes observations.
This imbalance in Training set will be accounted for in model pre-processing
Now to go ahead and create our splits to use in modeling later.
set.seed(1013)
cv_folds <- vfold_cv(train_data, strata = left, v = 10)
map_dbl(cv_folds$splits,
function(x) {
dat <- as.data.frame(x)$left
mean(dat == "Yes")
})
[1] 0 0 0 0 0 0 0 0 0 0
To achieve our final goal of finding a predictive model for attrition that can be used to quantify the ROI of an intervention, we’ll try several models. Our goal is both the most predictive model, but also one that is interpretable. We’ll start with a more complex model to see how predictive it is and then run a simpler model to compare effectiveness.
First we’ll prep several model specifications.
#CREATING MODELS FOR SCREENING
#Helpful model choice approach chart: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
#Logistic Regression
#SIMPLEST
spec_logistic <-
logistic_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glm") %>%
set_mode("classification")
#Random Forest - Decision Tree Method
#MORE COMPLEX
spec_rf <-
rand_forest(mtry = 5, trees = 10000, min_n = 4) %>% #Replace with trees = #tune()
set_engine("randomForest", importance = TRUE) %>% #set_engine("ranger", importance = "impurity)
set_mode("classification") %>%
parsnip::translate()
#Neural Net
#MORE COMPLEX
spec_nnet <-
mlp() %>%
set_engine("keras", verbose = 0) %>%
set_mode("classification")
#Support Vector Machine (LinearSV Classification)
#MORE COMPLEX
spec_svm_linear2 <-
svm_linear(cost = 1) %>%
set_engine("kernlab") %>%
set_mode("classification")
#SOME OTHER MODEL IDEAS
# dt_loan <- decision_tree(cost_complexity = tune(), tree_depth = tune(), min_n = tune()) %>%
# set_engine("rpart") %>%
# set_mode("classification")
#
# xgboost_loan <- boost_tree(mtry = tune(), trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), sample_size = tune()) %>%
# set_engine("xgboost") %>%
# set_mode("classification")
#
# #K-Nearest Neighbors - This is a type of SVM
# knn_loan <- nearest_neighbor(neighbors = tune(), weight_func = tune(), dist_power = tune()) %>%
# set_engine("kknn") %>%
# set_mode("classification")
# spec_rf <-
# rand_forest(mtry = tune(), trees = 1000, min_n = tune()) %>% #Replace with trees = #tune()
# set_engine("randomForest", importance = TRUE) %>%
# set_mode("classification")
# spec_svm_linear <-
# svm_poly(degree = 1) %>%
# set_engine("kernlab") %>%
# set_mode("classification")
Then we’ll create recipes that can handle preprocessing of our data. Given the imbalance in the outcome variable, we’ll add steps to correct in our training data. We’ll also center and scale (normalize) data since we saw there was a good bit of non-normalness and bifurcated categorizations in predictor variables.
#CREATE FOUR RECIPES
#"Some models (notably neural networks, K-nearest neighbors, and support vector machines) require predictors that have been centered and scaled" - https://www.tmwr.org/workflow-sets.html
#PREPROCESSING RULES FROM KUHN, p550
#LOGISTIC REGRESSION - CS, NZV, REMOVE CORR VARS
#NEURAL NET - CS, NZV, REMOVE CORR VARS
#SVM - CS
#RANDOM FOREST - N/A
set.seed(1013)
conflict_prefer("step_upsample", "themis")
[conflicted] Removing existing preference
[conflicted] Will prefer themis::step_upsample over any other package
#RECIPE 1 - Centered & Scaled-BoxCox; Upsample; NZV
recipe_1 <- recipe(left ~., data = train_data) %>%
#update_role(ID, new_role = "id") %>%
step_BoxCox(all_numeric_predictors())%>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_nzv(all_predictors()) %>%
step_upsample(left, skip = TRUE)
#RECIPE 2 - Centered & Scaled; Upsample; NZV
recipe_2 <- recipe(left ~., data = train_data) %>%
#update_role(ID, new_role = "id") %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_nzv(all_predictors()) %>%
step_upsample(left, skip = TRUE)
#RECIPE 3 - Not Centered & Scaled; Upsample
recipe_3 <- recipe(left ~., data = train_data) %>%
# update_role(ID, new_role = "id") %>% ## Per Julia Silge
step_dummy(all_nominal_predictors()) %>%
step_upsample(left, skip = TRUE)
#RECIPE 4 - Not Centered & Scaled; Upsample; NZV
recipe_4 <- recipe(left ~., data = train_data) %>%
#update_role(ID, new_role = "id") %>%
step_dummy(all_nominal_predictors()) %>%
step_nzv(all_predictors()) %>%
step_upsample(left, skip = TRUE)
We’ll start by running a complicated model to set “ceiling potential” for prediction accuracy.
svm_workflow <- workflow() %>%
add_recipe(recipe_2) %>%
add_model(spec_svm_linear2)
svm_workflow
== Workflow ===============================================================================================
Preprocessor: Recipe
Model: svm_linear()
-- Preprocessor -------------------------------------------------------------------------------------------
5 Recipe Steps
* step_center()
* step_scale()
* step_dummy()
* step_nzv()
* step_upsample()
-- Model --------------------------------------------------------------------------------------------------
Linear Support Vector Machine Specification (classification)
Main Arguments:
cost = 1
Computational engine: kernlab
#RESIDUALS WITHOUT TUNING
conflict_prefer("alpha", "scales")
[conflicted] Will prefer scales::alpha over any other package
svm_res <-
svm_workflow %>%
fit_resamples(
resamples = cv_folds,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec),
control = control_resamples(
save_pred = TRUE)
)
Warning: package ‘rlang’ was built under R version 4.0.5
Warning: package ‘vctrs’ was built under R version 4.0.5
Warning: package ‘kernlab’ was built under R version 4.0.3
svm_res
# Resampling results
# 10-fold cross-validation using stratification
#FIT THE MODEL WITH TRAINING DATA
svm_fit <- svm_workflow %>%
fit(data = train_data)
Setting default kernel parameters
svm_fit
== Workflow [trained] =====================================================================================
Preprocessor: Recipe
Model: svm_linear()
-- Preprocessor -------------------------------------------------------------------------------------------
5 Recipe Steps
* step_center()
* step_scale()
* step_dummy()
* step_nzv()
* step_upsample()
-- Model --------------------------------------------------------------------------------------------------
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
Linear (vanilla) kernel function.
Number of Support Vectors : 6539
Objective Function Value : -6516.875
Training error : 0.21219
Probability model included.
#PLOT TUNED RESIDUALS
tune_res <- tune_grid(
svm_tune_wf,
resamples = sim_data_fold,
grid = param_grid
)
x Fold04: preprocessor 1/1, model 5/10 (predictions): Error: $ operator is invalid for atomic vectors
x Fold09: preprocessor 1/1, model 4/10 (predictions): Error: $ operator is invalid for atomic vectors
autoplot(tune_res)
#Plugging in best value for cost for final model fit
best_cost_svm <- select_best(tune_res, metric = "roc_auc")
svm_linear_final <- finalize_workflow(svm_workflow, best_cost_svm)
svm_linear_fit <- svm_linear_final %>% fit(train_data)
Setting default kernel parameters
svm_linear_fit
== Workflow [trained] =====================================================================================
Preprocessor: Recipe
Model: svm_linear()
-- Preprocessor -------------------------------------------------------------------------------------------
5 Recipe Steps
* step_center()
* step_scale()
* step_dummy()
* step_nzv()
* step_upsample()
-- Model --------------------------------------------------------------------------------------------------
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
Linear (vanilla) kernel function.
Number of Support Vectors : 6539
Objective Function Value : -6516.875
Training error : 0.21219
Probability model included.
#CONFUSION MATRIX WITH TUNED COST
augment(svm_linear_fit, new_data = test_data) %>%
conf_mat(truth = left, estimate = .pred_class)
Truth
Prediction 0 1
0 1349 83
1 560 508
# Truth
# Prediction 0 1
# 0 1791 434
# 1 118 157
TO DO - UPDATE THIS DESCRIPTION AFTER RERUNNING * 53.9% (n=1349) Prediction was No Attrition, Truth was No Attrition * 6.3% (n=508) Prediction was Attrition, Truth was Attrition
77.9% Accurate prediction
22.1% Inaccurate prediction
In an ROI scenario - this model would
augment(svm_linear_fit, new_data = test_data) %>%
conf_mat(truth = left, estimate = .pred_class) %>%
summary()
After running a more complicated SVM model to see how good it could be, we’ll run a simple Logistic Regression to compare and contrast.
#LOGISTIC REGRESSION WORKFLOW
logit_wflow <- workflow() %>%
add_recipe(recipe_2) %>%
add_model(spec_logistic)
logit_wflow
== Workflow ===============================================================================================
Preprocessor: Recipe
Model: logistic_reg()
-- Preprocessor -------------------------------------------------------------------------------------------
5 Recipe Steps
* step_center()
* step_scale()
* step_dummy()
* step_nzv()
* step_upsample()
-- Model --------------------------------------------------------------------------------------------------
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = tune()
Computational engine: glm
#RESIDUALS WITHOUT TUNING
log_res <-
logit_wflow %>%
fit_resamples(
resamples = cv_folds,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec),
control = control_resamples(
save_pred = TRUE)
)
log_res
# Resampling results
# 10-fold cross-validation using stratification
# save model coefficients for a fitted model object from a workflow
get_model <- function(x) {
pull_workflow_fit(x) %>% tidy()
}
# same as before with one exception
log_res_2 <-
logit_wflow %>%
fit_resamples(
resamples = cv_folds,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec),
control = control_resamples(
save_pred = TRUE,
extract = get_model) # use extract and our new function
)
! Fold01: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
! Fold02: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
! Fold03: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
! Fold04: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
! Fold05: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
! Fold06: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
! Fold07: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
! Fold08: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
! Fold09: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
! Fold10: internal: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip(...
all_coef <- map_dfr(log_res_2$.extracts, ~ .x[[1]][[1]]) #Results flattened and collected
log_res_2$.extracts[[1]][[1]] #Show results
[[1]]
NA
set.seed(1013)
#Fit with formula and model
#CROSS VALIDATE LOGIT_WKFLOW ON CV FOLDS
fit_resamples(
logit_wflow,
model = spec_logistic,
resamples = cv_folds
)
Warning: The `...` are not used in this function but one or more objects were passed: 'model'
# Resampling results
# 10-fold cross-validation using stratification
#PRINT RESULTS
#sAVE AS Object
set.seed(1013)
logit_tune_results <- fit_resamples(logit_wflow,
spec_logistic,
resamples = cv_folds) %>%
collect_metrics()
Warning: The `...` are not used in this function but one or more objects were passed: ''
logit_last_fit <- logit_wflow %>%
# fit on the training set and evaluate on the test set
last_fit(data_split)
logit_last_fit
# Resampling results
# Manual resampling
logit_test_performance <- logit_last_fit %>% collect_metrics()
logit_test_performance
Overall the performance is not great. Similar to our training data, accuracy of .74 and AUC of .81. Accuracy is even with our baseline of 74% and ROC is higher by 6%. This may be valuable for our purposes, depending on where the accuracy is coming from.
#Generate test predictions from the test set.
logit_test_predictions <- logit_last_fit %>% collect_predictions
logit_test_predictions
#Plot the ROC Curve
logit_test_predictions %>%
roc_curve(left, .pred_0) %>% #Originally, was "pred_Yes" and curve was inverted
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_line(size = 1.5, color = "midnightblue") +
geom_abline(
lty = 2, alpha = 0.5,
color = "gray50",
size = 1.2
)
# generate a confusion matrix
conflict_prefer("spec", "yardstick")
[conflicted] Removing existing preference
[conflicted] Will prefer yardstick::spec over any other package
logit_test_predictions %>%
conf_mat(truth = left, estimate = .pred_class)
Truth
Prediction 0 1
0 1389 124
1 520 467
# Truth
# Prediction 0 1
# 0 1389 124
# 1 520 467
RESULTS
55.5% (n=1389) Prediction was No Attrition, Truth was No Attrition
18.7% (n=467) Prediction was Attrition, Truth was Attrition
74.2% Accurate prediction
20.8% (n=520) Prediction was Attrition, Truth was No Attrition
5.0% (n=124) Prediction was No Attrition, Truth was Attrition
25.8% Inaccurate prediction
This is slightly worse than our more complicated SVM model in overall prediction accuracy, but not by much. This model has a much smaller percentage of False Negatives than the SVM model, which is more useful when predicting attrition, given we will want to be sure our proposed intervention ROI is not overestimated.
logit_test_predictions %>%
ggplot() +
geom_density(aes(x = .pred_1,
fill = left),
alpha = 0.5)
logit_test_predictions %>%
conf_mat(truth = left, estimate = .pred_class) %>%
summary()
#LIST IMPORTANT VARIABLES
logit_final_model <- fit(logit_wflow, Data)
logit_final_model
== Workflow [trained] =====================================================================================
Preprocessor: Recipe
Model: logistic_reg()
-- Preprocessor -------------------------------------------------------------------------------------------
5 Recipe Steps
* step_center()
* step_scale()
* step_dummy()
* step_nzv()
* step_upsample()
-- Model --------------------------------------------------------------------------------------------------
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) satisfaction_level last_evaluation number_project
-0.14259 -1.10614 0.21005 -0.48835
average_monthly_hours time_spend_company Work_accident department_IT
0.21479 0.58968 -0.56055 -0.23031
department_marketing department_product_mng department_RandD department_sales
-0.02386 0.02850 -0.60682 -0.01250
department_support department_technical salary_medium salary_high
0.06263 0.23822 -0.43203 -1.72928
Degrees of Freedom: 15269 Total (i.e. Null); 15254 Residual
Null Deviance: 21170
Residual Deviance: 16040 AIC: 16070
rf_workflow <- workflow() %>%
add_recipe(recipe_3) %>%
add_model(spec_rf)
rf_workflow
== Workflow ===============================================================================================
Preprocessor: Recipe
Model: rand_forest()
-- Preprocessor -------------------------------------------------------------------------------------------
2 Recipe Steps
* step_dummy()
* step_upsample()
-- Model --------------------------------------------------------------------------------------------------
Random Forest Model Specification (classification)
Main Arguments:
mtry = 5
trees = 10000
min_n = 4
Engine-Specific Arguments:
importance = TRUE
Computational engine: randomForest
Model fit template:
randomForest::randomForest(x = missing_arg(), y = missing_arg(),
mtry = min_cols(~5, x), ntree = 10000, nodesize = min_rows(~4,
x), importance = TRUE)
set.seed(1013)
rf_fit <- fit(rf_workflow, data = train_data)
rf_fit
== Workflow [trained] =====================================================================================
Preprocessor: Recipe
Model: rand_forest()
-- Preprocessor -------------------------------------------------------------------------------------------
2 Recipe Steps
* step_dummy()
* step_upsample()
-- Model --------------------------------------------------------------------------------------------------
Call:
randomForest(x = maybe_data_frame(x), y = y, ntree = ~10000, mtry = min_cols(~5, x), nodesize = min_rows(~4, x), importance = ~TRUE)
Type of random forest: classification
Number of trees: 10000
No. of variables tried at each split: 5
OOB estimate of error rate: 0.46%
Confusion matrix:
0 1 class.error
0 5713 13 0.002270346
1 40 5686 0.006985679
#USING RANGER ENGINE, 1000 TREES. SAME RESULT WITH 10k TREES AND RANGER ENGINE
#WITH 1000 TREES
# Warning: tune columns were requested but there were 16 predictors in the data. 16 will be used.
# Warning: tune samples were requested but there were 11452 rows in the data. 11452 will be used.
# Call:
# randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~tune(), x), nodesize = min_rows(~tune(), x), importance = ~TRUE)
# Type of random forest: classification
# Number of trees: 1000
# No. of variables tried at each split: 16
#
# OOB estimate of error rate: 21.73%
# Confusion matrix:
# 0 1 class.error
# 0 4899 827 0.1444289
# 1 1662 4064 0.2902550
#WITH 10000 TREES
# Ranger result
#
# Call:
# ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~tune(), x), num.trees = ~10000, min.node.size = min_rows(~tune(), x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#
# Type: Probability estimation
# Number of trees: 10000
# Sample size: 11452
# Number of independent variables: 18
# Mtry: 18
# Target node size: 11452
# Variable importance mode: impurity
# Splitrule: gini
# OOB prediction error (Brier s.): 0.2500432
Pulling the confusion matrix out of the output:
OOB estimate of error rate: 0.46%
Confusion matrix: 0 1 class.error 0 5713 13 0.002270346 1 40 5686 0.006985679
augment(rf_fit, new_data = test_data) %>%
rmse(truth = as.numeric(left), estimate = as.numeric(.pred_class))
rf.roc<-roc(train_data$left, rf_final$votes[,2])
Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(rf.roc)
pROC::auc(rf.roc)
Area under the curve: 0.9889
rf_testing_pred %>% # test set predictions
roc_auc(truth = left, .pred_1)
rf_testing_pred %>% # test set predictions
accuracy(truth = left, .pred_class)
last_fit_rf <- last_fit(rf_workflow,
split = data_split,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec)
)
Warning: package ‘parsnip’ was built under R version 4.0.5
Warning: package ‘tibble’ was built under R version 4.0.5
Warning: package ‘dials’ was built under R version 4.0.5
Warning: package ‘rsample’ was built under R version 4.0.5
Warning: package ‘workflows’ was built under R version 4.0.5
Warning: package ‘tidyr’ was built under R version 4.0.5
Warning: package ‘rlang’ was built under R version 4.0.5
Warning: package ‘vctrs’ was built under R version 4.0.5
Warning: package ‘randomForest’ was built under R version 4.0.5
#Show performance metrics
last_fit_rf %>%
collect_metrics()
rf_final
Call:
randomForest(formula = left ~ ., data = train_data, mtry = 5, trees = 10000, min_n = 4, keep.forest = TRUE, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 1.19%
Confusion matrix:
0 1 class.error
0 5717 9 0.001571778
1 80 1693 0.045121263
importance(rf_final, sort = TRUE)
0 1 MeanDecreaseAccuracy MeanDecreaseGini
satisfaction_level 101.400344 337.898530 319.888350 1124.759934
last_evaluation 18.640044 111.035368 113.710098 302.200133
number_project 69.694218 239.160597 235.333583 445.560219
average_monthly_hours 62.586243 101.416338 111.910062 328.637547
time_spend_company 57.939171 82.418274 91.668575 459.621633
Work_accident 4.319829 8.684521 9.092943 4.591932
promotion_last_5years 2.830797 5.160545 5.064632 1.076596
department 7.580276 48.293374 29.060656 29.775646
salary 7.210778 27.537577 19.853285 13.094145
Number of projects and Satisfaction level are strong predictors in this model.
varUsed(rf_final, count = TRUE, by.tree = FALSE)
#LOGISTIC REGRESSION WORKFLOW
nn_workflow <- workflow() %>%
add_recipe(recipe_1) %>%
add_model(spec_nnet)
nn_workflow
set.seed(1013)
nnet_res <-
nn_workflow %>%
fit_resamples(
resamples = cv_folds,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec),
control = control_resamples(save_pred = TRUE)
)
nnet_res
library(NeuralNetTools)
#https://fawda123.github.io/NeuralNetTools/articles/Overview.html
nn_model <- nnet::nnet(left ~ ., data = train_data, size = 10)
par(mar = numeric(4))
plotnet(nn_model)
# The plotnet function plots a neural network as a simple network or as a neural interpretation diagram (NID). The default settings are to plot as NID with positive weights between layers as black lines and negative weights as grey lines. Line thickness is in proportion to relative magnitude of each weight. The first layer includes only input variables with nodes labelled as I1 through In for n input variables. One through many hidden layers are plotted with each node in each layer labelled as H1 through Hn. The output layer is plotted last with nodes labeled as O1 through On. Bias nodes connected to the hidden and output layers are also shown.
garson(nn_model)
olden(nn_model)
# The garson function uses Garson’s algorithm to evaluate relative variable importance. This function identifies the relative importance of explanatory variables for a single response variable by deconstructing the model weights. The importance of each variable can be determined by identifying all weighted connections between the layers in the network. That is, all weights connecting the specific input node that pass through the hidden layer to the response variable are identified. This is repeated for all other explanatory variables until a list of all weights that are specific to each input variable is obtained. The connections are tallied for each input node and scaled relative to all other inputs. A single value is obtained for each explanatory variable that describes the relationship with the response variable in the model. The results indicate relative importance as the absolute magnitude from zero to one. The function cannot be used to evaluate the direction of the response. Only neural networks with one hidden layer and one output node can be evaluated.
# The olden function is an alternative and more flexible approach to evaluate variable importance. The function calculates iportance as the product of the raw input-hidden and hidden-output connection weights between each input and output neuron and sums the product across all hidden neurons. An advantage of this approach is the relative contributions of each connection weight are maintained in terms of both magnitude and sign as compared to Garson’s algorithm which only considers the absolute magnitude. For example, connection weights that change sign (e.g., positive to negative) between the input-hidden to hidden-output layers would have a cancelling effect whereas Garson’s algorithm may provide misleading results based on the absolute magnitude. An additional advantage is that Olden’s algorithm is capable of evaluating neural networks with multiple hidden layers and response variables. The importance values assigned to each variable are in units that are based directly on the summed product of the connection weights. The actual values should only be interpreted based on relative sign and magnitude between explanatory variables. Comparisons between different models should not be made.
last_fit_nn <- last_fit(nn_workflow,
split = data_split,
metrics = metric_set(
recall, precision, f_meas,
accuracy, kap,
roc_auc, sens, spec)
)
#Show performance metrics
last_fit_nn %>%
collect_metrics()
These are our final performance metrics. The model performs well both on the training and test data. We have some insight into the variables driving our model, including Average Monthly Hours and Number of projects.
last_fit_nn %>%
collect_predictions() %>%
roc_curve(left, .pred_0) %>%
autoplot()
last_fit_nn %>%
collect_predictions() %>%
conf_mat(truth = left, estimate = .pred_class)
RESULTs: * 71.6% (n=1791) Prediction was No Attrition, Truth was No Attrition * 21.8% (n=545) Prediction was Attrition, Truth was Attrition – 93.4% Accurate prediction
This is a much better accuracy than either the SVM Or Logistic Regression models in all directions. However, interpretability may be too difficult to make it useful.
#https://www.kirenz.com/post/2021-02-17-r-classification-tidymodels/#last-evaluation-on-test-set
log_metrics <-
log_res %>%
collect_metrics(summarise = TRUE) %>%
mutate(model = "Logistic Regression")
rf_metrics <-
rf_res %>%
collect_metrics(summarise = TRUE) %>%
mutate(model = "Random Forest") #Seem to be running into a Windows error / known bug in package preventing ability to get these metrics for Random Forest model
svm_metrics <-
svm_res %>%
collect_metrics(summarise = TRUE) %>%
mutate(model = "Support Vector Classifier")
nnet_metrics <-
nnet_res %>%
collect_metrics(summarise = TRUE) %>%
mutate(model = "Neural Net")
# create dataframe with all models
model_compare <- bind_rows(
log_metrics,
rf_metrics,
svm_metrics,
nnet_metrics
)
# change data structure
model_comp <-
model_compare %>%
select(model, .metric, mean, std_err) %>%
pivot_wider(names_from = .metric, values_from = c(mean, std_err))
# show mean Accuracy-Score for every model
model_comp %>%
arrange(mean_accuracy) %>%
mutate(model = forcats::fct_reorder(model, mean_accuracy)) %>% # order results
ggplot(aes(model, mean_accuracy, fill=model)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Blues") +
geom_text(
size = 3,
aes(label = round(mean_f_meas, 2), y = mean_f_meas + 0.08),
vjust = 1
)
# show mean area under the curve (auc) per model
model_comp %>%
arrange(mean_roc_auc) %>%
mutate(model = forcats::fct_reorder(model, mean_roc_auc)) %>%
ggplot(aes(model, mean_roc_auc, fill=model)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Blues") +
geom_text(
size = 3,
aes(label = round(mean_roc_auc, 2), y = mean_roc_auc + 0.08),
vjust = 1
)
Feature Coefficients
Turns out there is a bug in this package so we can’t make all the fancy charts. wah wah
#FROM: https://www.kdnuggets.com/2019/06/modelplotr-cran-business-value-predictive-models.html
# set.seed(1013)
#
# #LOGIT
# logit_final <- logistic_reg(left ~ ., mode = "classification", engine = "glm")
#
# logit_final_fit <-
# logit_final %>%
# fit(left ~ ., data = train_data)
#
# #RF
# rf_final <- randomForest(left ~ ., data = train_data, mtry = 5, trees = 10000, min_n = 4, keep.forest = FALSE, importance = TRUE)
#
# # rf_final_fit <-
# # rf_final %>%
# # fit(left ~ ., data = train_data)
#
# #NN
# nn_final <- mlp(left ~ ., mode = "classification", engine = "keras")
#
# # nn_final_fit <-
# # nn_final %>%
# # fit(left ~ ., data = train_data)
#
# #SVM
# svm_final <- svm_linear(left ~ ., mode = "classification", engine = "kernlab", cost = 1)
#
# svm_final_fit <-
# svm_final %>%
# fit(left ~ ., data = train_data)
#library(modelplotr)
# transform datasets and model objects into scored data and calculate ntiles
# preparation steps
# scores_and_ntiles <- prepare_scores_and_ntiles(datasets = list("train_data","test_data"),
# dataset_labels = list("train data","test data"),
# models = list("logit_final_fit","rf_final_fit", "nn_final", "svm_final_fit"),
# model_labels = list("Logistic Regression","Random Forest", "Neural Net",
# "Support Vector Classifier"),
# target_column="left",
# ntiles=100)
#Error: Did you mean to use `new_data` instead of `newdata`? -- insurmountable bug
# transform data generated with prepare_scores_and_ntiles into aggregated data for chosen plotting scope
# plot_input_log <- plotting_scope(prepared_input = scores_and_ntiles,
# select_model_label = "Logistic Regression",
# select_dataset_label = "test data")
#
# plot_input_rf <- plotting_scope(prepared_input = scores_and_ntiles,
# select_model_label = "Random Forest",
# select_dataset_label = "test data")
#
# plot_input_nn <- plotting_scope(prepared_input = scores_and_ntiles,
# select_model_label = "Neural Net",
# select_dataset_label = "test data")
#
# plot_input_svm <- plotting_scope(prepared_input = scores_and_ntiles,
# select_model_label = "Support Vector Classifier",
# select_dataset_label = "test data")
# plot all four evaluation plots and save to file, highlight decile 2
# plot_multiplot(data = plot_input_log, highlight_ntile=2,
# save_fig = TRUE, save_fig_filename = 'Predictive Attrition Model')
#
# plot_multiplot(data = plot_input_rf, highlight_ntile=2,
# save_fig = TRUE, save_fig_filename = 'Predictive Attrition Model')
#
# plot_multiplot(data = plot_input_nn, highlight_ntile=2,
# save_fig = TRUE, save_fig_filename = 'Predictive Attrition Model')
#
# plot_multiplot(data = plot_input_svm, highlight_ntile=2,
# save_fig = TRUE, save_fig_filename = 'Predictive Attrition Model')
#Profit plot - automatically highlighting the ntile where profit is maximized!
# plot_profit(data = plot_input_log,
# fixed_costs = 100000,
# variable_costs_per_unit = 5000,
# profit_per_unit = 95000)
#
# plot_profit(data = plot_input_rf,
# fixed_costs = 100000,
# variable_costs_per_unit = 5000,
# profit_per_unit = 95000)
#
# plot_profit(data = plot_input_nn,
# fixed_costs = 100000,
# variable_costs_per_unit = 5000,
# profit_per_unit = 95000)
#
# plot_profit(data = plot_input_svm,
# fixed_costs = 100000,
# variable_costs_per_unit = 5000,
# profit_per_unit = 95000)